STAT 331: Final Project

Author

Ethan Handelman, Maxwell Dubow, Aaron Eliscu

Setup

library(tidyverse)
library(knitr)
library(kableExtra)
library(ggplot2)
library(plotly)
library(gganimate)
library(gifski)
library(png)
library(countrycode)
library(broom)

life_expectancy <- read_csv("Data/lex.csv")
coal_per_cap <- read_csv("Data/coal_consumption_per_cap.csv",
           na = c("NA", ""))

Pivoting/Joining the Data

life_expectancy <- life_expectancy |>
  pivot_longer(cols = 2:last_col(),
    names_to = "year",
    values_to = "infant_lex") |>
  drop_na(infant_lex)

coal_per_cap <- coal_per_cap |>
  mutate(across(.cols = 2:last_col(), as.double)) |>
  pivot_longer(cols = 2:last_col(),
               names_to = "year",
               values_to = "coal_cons_per_cap") |>
  drop_na(coal_cons_per_cap)

lex_coal <- coal_per_cap |>
  left_join(life_expectancy, by = c("country", "year")) |>
  filter(!is.na(coal_cons_per_cap), !is.na(infant_lex)) |>
  mutate(year = as.integer(year),
         continent = countrycode(sourcevar = country,
                                 origin    = "country.name",
                                 destination = "continent")) 

Data Description

In this investigative report we will be exploring the relationship between two quantitative variables across countries/years:

The data on coal consumption per capita comes from BP Statistical Review’s report on global energy markets over time. The data we are interested in is the amount of coal consumption per country per year. The data set, which we downloaded from Gapminder contains countries as observations and coal consumption in a specific year as variables. We transformed the data so that coal consumption was the only variable and each observation was a country and year.

The data on infant life expectancy comes from Gapminder’s own data collection from various different sources. It contains information on the life expectancy at birth in each country and year. The data set, which we downloaded from Gapminder contains countries as observations and infant life expectancy in a specific year as variables. We transformed the data so that life expectancy was the only variable and each observation was a country and year.

Finally, we combined these two data sets so that for each observation (country and year), we had coal consumption and infant life expectancy as our two variables. There were a few missing values in the data, which we ignored in our analysis.

Hypothesis

We hypothesize that infant life expectancy will have a negative association with coal consumption after adjusting for year. This is because burning coal releases many by products into the atmosphere that form toxic chemicals, and “continuous inhalation of these hazardous substances triggers many diseases such respiratory and cardiovascular disease, systemic inflammation, and neurodegeneration” (Gasparotto and Da Boit Martinello 2021). However, with the elimination of a lot of the infectious disease deaths that affected young children and the better understanding of treading cardiovascular conditions and cancer, life expectancy has drastically increased each year (Crimmins 2015). Therefore, we will have to adjust for this effect in our analysis to see if there is a negative association between coal consumption and infant life expectancy like we expect.

Data Visualization

lex_coal_cont <- lex_coal |>
  filter(!is.na(coal_cons_per_cap), !is.na(infant_lex)) |>
  
  group_by(continent, country) |>
  summarize(mean_coal = mean(coal_cons_per_cap),
            mean_infant = mean(infant_lex),
            .groups = "drop") |>
  filter(mean_coal != 0 & mean_infant != 0) |>
  mutate(log_mean_coal = log(mean_coal),
         log_mean_infant = log(mean_infant))
  
p <- lex_coal_cont |>
  ggplot(mapping = aes(x = log_mean_coal, 
                       y = log_mean_infant, 
                       label = country,
                       color = continent,
                       text = paste0("<b>Country:</b> ", 
                                     country, 
                                     "<br>"))) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  # geom_text(check_overlap = TRUE) +
  labs(x = "Log Mean Coal Consumption (tonnes oil equivalent)",
       y = "Log Infant Life Expectancy (years)",
       title = "Infant Life Expectancy vs Coal Consumption by Country") + theme(axis.title = element_text(size = 14),
                                                                                plot.title = element_text(hjust = 0.5))

ggplotly(p, tooltip = "text")
p_anim <- ggplot(lex_coal, aes(x = coal_cons_per_cap, y = infant_lex)) +
  geom_point(alpha = 0.7, size = 2, color = "#2c3e50") +
  # For raw‐scale axes, comment these out; 
  # for log log, uncomment them:
  # scale_x_continuous(trans = "log10") +
  # scale_y_continuous(trans = "log10") +
  labs(
    title = "Infant Life Expectancy vs. Coal Consumption",
    subtitle = "Year: {frame_time}",
    x = "Coal Consumption per Capita (tonnes oil equivalent)",
    y = "Infant Life Expectancy (years)"
  ) +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal(base_size = 14) +
  theme(
    plot.subtitle = element_text(size = 16, face = "bold"),
    legend.position = "right"
  ) +
  transition_time(year) +
  ease_aes("linear")

anim <- animate(
  p_anim,
  nframes = length(unique(lex_coal$year)) * 5,
  fps = 10,
  width = 800,
  height = 600,
  renderer = gifski_renderer(loop = TRUE)
)
Error in device(files[i], width = 800, height = 600, units = "in", res = 192): unable to start png() device
anim
Error: object 'anim' not found
anim_save("coal_infant_by_continent.gif", animation = anim)
Error: object 'anim' not found

Animated Relationship

Linear Regression Model

model <- lm(log_mean_infant ~ log_mean_coal, data = lex_coal_cont)
coefs <- broom::tidy(model)

knitr::kable(
  coefs,
  digits  = 3,
  caption = "Regression Coefficients\n(log Mean Infant Life Expectancy ∼ log Mean Coal Consumption)",
  col.names = c("Term", "Estimate", "Std. Error", "t value", "p value")
) |>
  kableExtra::kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Regression Coefficients (log Mean Infant Life Expectancy ∼ log Mean Coal Consumption)
Term Estimate Std. Error t value p value
(Intercept) 4.304 0.010 417.595 0
log_mean_coal 0.017 0.004 4.329 0

Model Fit

var_response <- var(lex_coal_cont$log_mean_infant)
var_fitted <- var(fitted(model))
var_residuals <- var(residuals(model))
r_squared <- var_fitted / var_response
summary_table <- tibble(
  Metric = c("Variance of Response (A)", 
             "Variance of Fitted Values (B)", 
             "Variance of Residuals", 
             "Model R² (B/A)"),
  Value = c(var_response, var_fitted, var_residuals, r_squared)
)

summary_table |>
  kable(format = "html", digits = 4, caption = "Regression Model Variance Summary") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)
Regression Model Variance Summary
Metric Value
Variance of Response (A) 0.0053
Variance of Fitted Values (B) 0.0011
Variance of Residuals 0.0042
Model R² (B/A) 0.2042

##Analysis

In our model, our R-squared value is 0.2042. As a result, approximately 20.4% of the variability in log infant mortality rate is explained by the log of coal exposure. Since our p value is below 0.05, we are able to say that the relationship between infant mortality rate and coal exposure is statistically significant. In conclusion, our model explains about one fifth of the variability in infant mortality.

References

Crimmins, Eileen M. 2015. “Lifespan and Healthspan: Past, Present, and Promise.” The Gerontologist 55 (6): 901–11. https://doi.org/10.1093/geront/gnv130.
Gasparotto, Juciano, and Kátia Da Boit Martinello. 2021. “Coal as an Energy Source and Its Impacts on Human Health.” Energy Geoscience 2 (2): 113–20. https://doi.org/10.1016/j.engeos.2020.07.003.